Analysis of MRI and CT-based radiomics features for personalized treatment in locally advanced rectal cancer and external validation of published radiomics models

Radiomics analyses commonly apply imaging features of different complexity for the prediction of the endpoint of interest. However, the prognostic value of each feature class is generally unclear. Furthermore, many radiomics models lack independent external validation that is decisive for their clinical application. Therefore, in this manuscript we present two complementary studies. In our modelling study, we developed and validated different radiomics signatures for outcome prediction after neoadjuvant chemoradiotherapy (nCRT) in patients with locally advanced rectal cancer (LARC) based on computed tomography (CT) and T2-weighted (T2w) magnetic resonance (MR) imaging datasets of 4 independent institutions (training: 122, validation 68 patients). We compared different feature classes extracted from the gross tumour volume for the prognosis of tumour response and freedom from distant metastases (FFDM): morphological and first order (MFO) features, second order texture (SOT) features, and Laplacian of Gaussian (LoG) transformed intensity features. Analyses were performed for CT and MRI separately and combined. Model performance was assessed by the area under the curve (AUC) and the concordance index (CI) for tumour response and FFDM, respectively. Overall, intensity features of LoG transformed CT and MR imaging combined with clinical T stage (cT) showed the best performance for tumour response prediction, while SOT features showed good performance for FFDM in independent validation (AUC = 0.70, CI = 0.69). In our external validation study, we aimed to validate previously published radiomics signatures on our multicentre cohort. We identified relevant publications on comparable patient datasets through a literature search and applied the reported radiomics models to our dataset. Only one of the identified studies could be validated, indicating an overall lack of reproducibility and the need of further standardization of radiomics before clinical application.


Grade
Description Complete regression (TRG 4) No tumour cells Near complete regression (TRG 3) Very few tumour cells Moderate regression (TRG 2) Dominantly fibrotic changes with few tumour cells or groups Minimal regression (TRG 1) Dominant tumour mass with obvious fibrosis No regression (TRG 0) No regression  Figure S1: Image preprocessing and feature extraction pipeline. Magnetic resonance (MR) images were preprocessed and the gross tumour volume (GTV) was delineated centrally by one experienced radiation oncologist and one radiologist. GTV contours were then transferred to treatment planning computed tomography (CT) after rigid registration. All features were extracted from the GTV on the original and the Laplacian of Gaussian (LoG) transformed CT and MR images using a 3D approach.  Figure S2: Modelling study workflow. After image preprocessing, radiomic features were extracted from pre-treatment T2w magnetic resonance imaging (MRI) and treatment planning computed tomography (CT) and analysed for robustness. Features were separated into morphological and first-order (MFO), second-order texture (SOT), and Laplacian of Gaussian transformed (LOG) features. Also, features in each modality were analysed without any separation to feature classes represented here by 'All'. Clustering was performed and four radiomic signatures were created individually for T2w MRI and CT based on (i) MFO, (ii) SOT, (iii) LoG, and (iv) all features using a cross-validation approach and validated independently for both endpoints. Once these signatures were developed four joint MRI and CT signatures in each feature category (i) to (iv) for both endpoints were validated with and without adding significant clinical features

Section 1: Ranking scheme for feature selection
Here we explain an example of feature selection for LoG features for tumour response prediction. The same technique applies to FFDM prediction as well. Table S4 shows fourteen Laplacian of Gaussian transformed (LoG) MRI features and fifteen CT features with the highest mutual information with tumour response selected after hierarchical clustering. These features were then used to build a prognostic model. Feature selection and model building with internal validation was first performed within 33 repetitions of 3-fold cross-validation (CV) nested in the training dataset to identify an optimal signature. Four supervised feature-selection algorithms were considered: minimal redundancy maximum relevance (MRMR) [1], mutual information maximization (MIM) [2], Elastic-net (EN) [3], and univariate logistic regression (LR). To avoid potential overfitting, only the five most relevant features were selected in each cross-validation fold. These features were then used to build a multivariable logistic-regression model on the internal training part, and validated on the internal validation part. For each of the abovementioned feature selection methods, the occurrence of every feature in the 99 modelling steps was counted and features were ranked according to their occurrences across the cross-validation folds. Table S5 shows features with ≥50 % occurrence across each feature selection method that were further considered. Finally, features that showed repeated occurrences across at least 75% of the feature selection methods were selected (MR_log_ih_max_grad_fbn_n32 and MR_log_stat_min, CT_log_ih_max_grad_fbn_n32 and CT_log_stat_energy, Table S5). MR_log_ih_max_grad_fbn_n32 showed the highest cumulative occurrence (i.e. the highest sum of occurrences across all feature selection methods) of 365, while MR_log_stat_min showed a cumulative occurrence of 251. Both features showed a Spearman correlation of <0.5 on the entire training cohort, thus forming the MR-based LOG radiomic signature. A model with this signature was then fitted on the entire training data and the trained model was applied to the external validation data. Similarly, for CT data, CT_log_ih_max_grad_fbn_n32 showed the highest cumulative occurrence of 363, while CT_log_stat_energy showed a cumulative occurrence of 277. The features were strongly correlated with a Spearman correlation >0.5. Therefore, only CT_log_ih_max_grad_fbn_n32 was considered for the final CT-based LOG radiomic signature. The final performance in internal cross validation was considered as the average of the cross-validation training AUC (CV training) and validation AUC (CV validation). The finally selected signature and the average AUC in internal training and external validation are shown in Table S6.         Table 3 of the manuscript.

Included studies
The prospective single centre study by De Cecco et al. [4,5] extracted first-order intensity features from the tumour ROI delineated on the largest slice for the prediction of pathological complete responders (TRG=4) and non-responders (TRG=0-3). Images were transformed using the SSF (Laplacian of Gaussian) filter. The study reported SSF4 kurtosis to be significant in predicting response groups. To replicate this study, we extracted features from the largest tumour slices in our pooled cohort. Images were transformed using LoG filter and validation was performed for stat_kurt (IBSI: IPH6) using Wilcoxon rank-sum test and computing AUC. We adapted TRG status to match the study (TRG=4 vs TRG<4 following Dworak et al. [6]).
The study by Chidambaram et al. [7] was not strictly radiomics-based, as no high dimensional features were extracted from imaging data. However, this study analysed some basic morphological and statistical features for tumour response prediction using ADC maps of T2w MRI. Pre-treatment MRI volume was found to be significantly associated with tumour response (complete responders vs incomplete\non-responders following AJCC). We extracted morph_vol (IBSI: RNU0) from 3D GTV on the pooled cohort and analysed its significance via the Mann-Whitney-U test for tumour response (TRG=4 vs TRG<4 following Dworak et al. [6]).
In a retrospective study conducted by Caruso et al. [8] on a small cohort of 8 patients, the directional GLCM (no discretization mentioned) features extracted from T2w MRI were shown to be significantly different between pathological complete and incomplete responders, while partial responders were excluded from the study. Logistic regression model followed by Wald test for feature importance was used to analyse the predictive performance of features in the model. For validation, we excluded partial responders from our pooled training and validation data thus including only 65 patients (TRG=4 vs TRG=0 following Dworak et al. [6]). Directional features are not supported by MIRP thus we extracted 2D GLCM features using the average method, i.e. features were computed from all matrices and then averaged using fixed bin number 64. Validation was performed by fitting a multivariable logistic regression model followed by the Wald test to obtain p-values for each feature. IBSI synonyms for validated features are mentioned in Table S10.
Two multicentre retrospective studies (Cusumano et al. [9] and Dinapoli et al. [10]) have proposed statistical and intensity histogram features extracted from LoG transformed images together with clinical T and N stage for tumour response prediction (TRG=1 vs TRG>1, Mandard et al. [11]). The signature presented by Cusumano et al. also included an additional fractal feature that could not be extracted by MIRP. The studies did not report discretization for intensity histogram features. Final model coefficients were reported in the studies. To validate the study by Cusumano et al. pixel intensities inside GTV were normalized by 99 th percentile. The fractal feature was excluded, and the remaining features were extracted from 2D slices from discretized intensity histogram (25 bins). The model provided in each study was then applied on our pooled cohort. Details of features and their corresponding IBSI synonyms are presented in Table S10.
The study by Meng et al. [12] analysed statistical and intensity histogram features extracted from the largest tumour slices for tumour response prediction. The image intensities were discretized. However, the study did not report the number of bins used for discretization. Further response groups were created as responders (TRG=1-2) and non-responders (TRG=3-5) following Mandard et al. [11]. For validation, we extracted intensity histogram features from the largest tumour slices after discretizing image intensities into 25 uniform bins from pooled cohort. Finally, ih_kurt (IBSI: C3I7) feature was tested for tumour response prediction (TRG=3-4 vs TRG=0-2, Dworak et al. [6]) using the Mann-Whitney-U test.
The study by Cui et al. [13] used mpMRI to develop a radiomic signature. However, a standalone T2w MRI signature comprising directional GLCM, statistical, and morphological features extracted from the GTV for tumour response prediction (TRG=1 vs TRG>1 following Mandard et al. [11]) was also presented. The study did not report discretization, merge method, and feature extraction plane, i.e. 2D\3D for GLCM features. The study reports the coefficient of the final model built on zscore normalized features. Since directional features are not supported by MIRP, we extracted 3D GLCM features with a fixed bin number of 64 using the average method, i.e. features were computed from all matrices and then averaged, thus compensating for directional texture features. We then applied the model coefficients provided in study to compute the Radscore and AUC on our pooled cohort. Since we can get only closely related features for this study, we also fitted the logistic regression model on the train data and applied it to the validation dataset. However, this validation was not successful (AUC: train\valid = 0.72\0.32). We excluded 'HaralickCorrelation_angle90_offset7' from the signature, as by definition Haralick features are no different from GLCM 'Correlation_angle135_offset7' except for the difference in angle [14].
A retrospective study conducted by Antunes et al. [15] has reported 4 (1 Haralick co-occurrence,2 Gradient organization, 1 Laws energy response) features extracted from the largest tumour slice for tumour response prediction (TRG=4 vs TRG<4 following Dworak et al. [6]). Pre-processing applied before feature extraction in this study includes (i) image interpolation 0.781 × 0.781 × 4.0 mm, (ii) N4 bias correction, and (iii) intensity normalization reference to the mean intensity of the obturator internus muscle. To validate this study, we replicated steps (i) and (ii) of pre-processing, while for step (iii) relative range intensity normalization [0, 90] was performed. Furthermore, gradient organization features are not IBSI compliant. Therefore, none of the organization features could be extracted. Finally, a random forest model was created on training data using one feature, i.e. Skewness-Laws Wave-Ripple ws = 5, and subsequently transferred to the validation data. Feature importance was computed via the Mann-Whitney-U test.
Petkovska et al. [16] has reported 6 features (2 morphological, 2 grey level texture, 2 directional Gabor) extracted from the 3D tumour volume using T2w MRI. In the study, images and corresponding tumour mask were interpolated 1 × 1 × 1 mm prior to feature extraction. The study reported discretisation of normalized image intensities using fixed bin size=128, however normalization step was not clearly explained. To unambiguously specify Gabor filters at least two out of three parameters are required (scale (sigma), wavelength (lambda), bandwidth). The study reported only sigma values for Gabor filters. Model coefficient were provided in study for T2w signature. In our validation analysis we interpolated MR images to isotropic 1mm resolution using cubic interpolation followed by standard normalization of image intensities within soft tissue region. Grey level intensities were discretized using fixed bin size=128 for texture features. Gabor transformed features were extracted using angle and sigma values as reported in study however we used lambda=4 to complete feature extraction. Finally Radscore was computed by applying model coefficients using our pooled cohort and subsequently AUC was computed for tumour response prediction (TRG=4 vs TRG <4 following Dworak et al. [6]). Figure S7 shows the calibration plot for validation. Figure S7: Calibration plots for the study by Petkvoska et al. [16] A retrospective single-center study conducted by Petresc et al. [17] proposed a signature comprising second-order texture features on LoG and wavelet transformed images to predict tumour response (TRG=3 vs TRG=1,2 following Rayan et al.). Image intensities within the GTV were discretized using a fixed-bin width of 5. However, discretization of wavelet features, merge method for texture features, and neighbourhood distance for GLCM features was not documented. Model coefficients were provided in the study. In our validation analysis, we applied pre-processing steps as indicated in the study including image standardization (mean=0, std=100), B-spline interpolation, re-segmentation of segmentation mask. For feature extraction we used fixed bin number=64, fixed bin size=5, merge method=average, and GLCM neighbourhood distance=1. Further, we exclude the 'wavelet_hhl_glcm_MCC' feature as it is not standardized by IBSI. For the remaining features, we computed a Radscore by applying model coefficients on z-score normalized features using our pooled cohort and subsequently computed the AUC for tumour response prediction (TRG=3,4 vs TRG=0-2 following Dworak et al. [6]).